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ABSTRACT 



Recent years have witnessed an unprecedented proliferation of so- 
cial media. People around the globe author, every day, millions 
of blog posts, micro-blog posts, social network status updates, etc. 
This rich stream of information can be used to identify, on an ongo- 
ing basis, emerging stories, and events that capture popular atten- 
tion. Stories can be identified via groups of tightly-coupled real- 
world entities, namely the people, locations, products, etc., that are 
involved in the story. The sheer scale, and rapid evolution of the 
data involved necessitate highly efficient techniques for identifying 
important stories at every point of time. 

The main challenge in real-time story identification is the main- 
tenance of dense subgraphs (corresponding to groups of tightly- 
coupled entities) under streaming edge weight updates (resulting 
from a stream of user-generated content). This is the first work 
to study the efficient maintenance of dense subgraphs under such 
streaming edge weight updates. For a wide range of definitions 
of density, we derive theoretical results regarding the magnitude 
of change that a single edge weight update can cause. Based on 
these, we propose a novel algorithm, DynDens, which outper- 
forms adaptations of existing techniques to this setting, and yields 
meaningful results. Our approach is validated by a thorough exper- 
imental evaluation on large-scale real and synthetic datasets. 

1. INTRODUCTION 

Recent years have witnessed an unprecedented proliferation of 
social media. Millions of people around the globe author on a daily 
basis millions of blog posts, micro-blog posts and social network 
status updates. This content offers an uncensored window into cur- 
rent events, and emerging stories capturing popular attention. 

For instance, consider the U.S. military strike in Abbottabad, 
Pakistan in early May 201 1, which resulted in the death of Osama 
bin Laden. This event was extensively covered on Twitter, the pop- 
ular micro-blogging service, significantly in advance of traditional 
media, starting with the live coverage of the operation by an (unwit- 
ting) local witness, to millions of tweets around the world providing 
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Figure 1: Real-time identification of "bin Laden raid" story, 
and connection to Engagement 



a multifaceted commentary on every aspect of the story. Similar, if 
fewer, online discussions cover important events on an everyday 
basis, from politics and sports, to the economy and culture (no- 
table examples from recent years range from the death of Michael 
Jackson, to revolutions in the Middle East and the economic re- 
cession). In all cases, stories have a strong temporal component, 
making timeliness a prime concern in their identification. 

Interestingly, such stories can be identified by leveraging the 
real-world entities involved in them (e.g. people, politicians, prod- 
ucts and locations) [26]. The key observation is that each post on 
the story will tend to mention the same set of entities, around which 
the story is centered. In particular, as post length restrictions or 
conventions typically limit the number of entities mentioned in a 
single post, each post will tend to mention entities corresponding 
to a single facet of a story. Thus, by identifying pairs of entities 
that are strongly associated (recurrently mentioned together), one 
can implicitly detect facets of the underlying event of which they 
are the main actors. By piecing together these aspects, the overall 
event of interest can be inferred. 

For example, in the case of the U.S. military strike mentioned 
above, one facet, consisting of people discussing the raid, is cen- 
tered around "Abbottabad" where the raid took place, and the in- 
volvement of the "C.I.A."; another thread commenting on the pres- 
idential announcement, involves "Barack Obama" and "Osama bin 
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Laden"; and so on. The resulting overall story at some point of 
time involves the union of these entities. Such sets of entities can 
be then used by users of systems such as Grapevine [3] to enable 
the interactive exploration of the story. 

Given a measure to quantify the strength of association between 
two entities (such as the Log-likelihood ratio [26], the \ 2 measure, 
or the correlation-coefficient [5], etc.), one can abstract the real- 
time stream of posts giving rise to an evolving (weighted) entity 
graph, denoting the pairwise entity association strength 1 . An im- 
portant story can then be identified via a cohesive group of strongly 
associated entity pairs; i.e. a dense subgraph in the entity graph, 
given an appropriate definition of density. Moreover, note that, as 
the entities in a story need to be presented to users to facilitate nav- 
igation, story cardinality needs to be constrained to moderate sizes; 
after all, it would not be very interesting or helpful to present users 
with a story centered around 100 main entities. This process is 
illustrated in Figure 1 . 

Every post that is published, results in the weight update of one 
or more edges in the entity graph. The high frequency of post 
generation, coupled with our need for timely reporting of emerg- 
ing stories, necessitates that the identification of dense structures in 
the entity graph be highly efficient. This work thus addresses the 
problem of dENse subGrAph maintenance for edGE- weight update 
streaMs under sizE constraiNTs, or ENGAGEMENT for brevity. Be- 
sides being useful as-is for identifying stories from social media in 
real-time, solutions to this problem can also be used as building 
blocks for more complex computations; e.g. identified dense sub- 
graphs can undergo diversification before being presented to the 
user [2], or they can be reranked taking their external sparsity into 
account, in order to identify (soft) clusters of associated entities. 

Addressing ENGAGEMENT at web scales presents several chal- 
lenges. Principal among these is that, a change in the weight of 
a single edge, can impact the density of many subgraphs, neces- 
sitating a potentially unbounded exploration of the entity graph. 
Thus, any efficient solution to ENGAGEMENT needs to incremen- 
tally maintain dense subgraphs , without recomputing them from 
scratch. Moreover, there does not exist a single definition of graph 
density suitable for all scenarios; selecting the most appropriate 
definition for a given setting depends, for instance, on the perceived 
relative importance of having large, versus well-connected, dense 
subgraphs. Thus, solutions to ENGAGEMENT need to be applicable 
under general notions of density; however, existing techniques are 
only applicable to limited subsets of this problem. 

In this context, in this work we propose DynDens, an efficient 
algorithm for ENGAGEMENT. We theoretically quantify the magni- 
tude of change in dense subgraphs that a single edge weight update 
can cause. Based on this, we show how maintaining some sparse 
subgraphs, in addition to dense ones, enables the incremental main- 
tenance of dense subgraphs. The resulting algorithm, DynDens, 
makes use of an efficient index for subgraphs, which decreases 
memory consumption and processing effort. It is complemented 
by theoretically sound heuristics, that can offer improved perfor- 
mance. A comprehensive experimental evaluation on real and syn- 
thetic data highlights the effectiveness of our approach. 

To summarize, our main contributions in this work are: 
i) Motivated by the need to identify emerging stories in real- 
time, for a wide range of measures of entity association, we for- 
malize the problem of dENse subGrAph maintenance for edGE- 
weight update streaMs under sizE constraiNTs (ENGAGEMENT), 
for a very broad notion of graph density. 

ii) We propose an efficient algorithm DynDens, based on a 



novel quantification of the maximum possible change caused by 
a single edge weight update. By maintaining a small number of 
sparse subgraphs, DynDens is able to efficiently and incremen- 
tally compute dense subgraphs. 

iii) We design an efficient dense subgraph index, which decreases 
memory consumption and processing effort, and propose theoreti- 
cally sound heuristics for DynDens that can offer improved per- 
formance. 

iv) We validate our techniques via a thorough experimental eval- 
uation on both real and synthetic datasets. 

The remainder of this paper is organized as follows: After pro- 
viding a formal problem statement in Section 2, we present our 
proposed algorithm DynDens in Section 3. We explore the the- 
oretical basis for DynDens in Section 4, evaluate the proposed 
techniques in Section 5, and discuss some improvements to DYN- 
DENS in Section 6. Finally, we review related work in Section 7, 
and conclude in Section 8. 

2. FORMALIZATION 

Let us now turn to defining ENGAGEMENT. At a high level, let 
us consider a weighted graph, with a constant number of vertices. 
At every discrete time interval, the weights of one or more edges 
are adjusted (including potentially edge additions and removals). 
The goal is to maintain, at each point of time, all subgraphs with 
"density" greater than a given threshold. 

Connections to real-time story identification: Before fully 
formalizing the problem, let us first draw some connections to its 
application in real-time story identification. In this context, ver- 
tices correspond to real-world entities, and edge weights to their 
(current) pairwise association strengths (the choice of association 
strength measure will depend on characteristics of the specific prob- 
lem instance; in Section 5 we discuss several such choices). We 
assume that a procedure exists for processing streams of (entity- 
annotated 2 ) posts, and generating the appropriate edge weight up- 
dates at each time interval (in Section 5 we discuss such procedures 
for a variety of measures of interest). 

Data model: We represent the problem domain as i) a complete 
weighted graph G — (V, E) with N vertices, where Wij is the 
weight of edge between nodes i and j; and ii) a stream of edge 
weight updates of the form updatei — (a,b,8), signifying that 
at time instant i, the weight of the edge between vertices a and b 
changed from w a b to w a b + 8. 

Density: We define subgraph density as follows: for every sub- 
graph C C V, its density is dens(C) = sc ° re l ) ; wn ere score(C) 

= £^\ - eCM< (u>ij). S n is a function quantifying the relative im- 
portance of a subgraph's cardinality, n, to its density; with the ap- 
propriate choice of S n , virtually all quantifications of graph density 
can be represented. 

Note that we do not consider counter-intuitive quantifications of 
graph density, such as (but not limited to) a definition of density 
where the removal of a vertex from an unweighted clique results in 
an increase of its density. To safeguard against such quantifications 
of density, we require that S n have the following intuitive mono- 
tonicity properties: ^-j- < s s " < -^r^- 3 This encompasses 



The precise procedure used for identifying named entities in doc- 
uments, e.g. [3], is orthogonal to this work. 

, the density of an unweighted clique 
Moreover, observe that if 
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The association measure can also incorporate notions of recency 
of association, e.g. by including some form of temporal decay. 
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clique K3 will increase if it is augmented by a single vertex, con- 
nected with a single edge to one of the clique vertices. 
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the full spectrum of choices of density functions commonly used 
in the literature; typical choices include S n — - l 2 ~ (thus den- 
sity is defined as the average edge weight, favoring small, dense 
subgraphs; we term this instantiation AvgWeight), and S n — n 
(thus density represents a generalized average node "degree", fa- 
voring large subgraphs; we term this case AvgDegree). 

Cardinality constraint: Finally, let N m ax be a (user-specified) 
maximum cardinality for subgraphs of interest. (In the context of 
real-time story identification, this constraint ensures that any sub- 
graphs identified are small enough to be used for navigation / ex- 
ploration purposes - cf. Section 1). 

ENGAGEMENT: Given the above, the goal of ENGAGEMENT is 
to maintain, at every point of time i, the subgraphs (vertex subsets) 
with density over a given threshold T, subject to cardinality con- 
straints, i.e. {Vj\Vj C V A dens(Vj) > T A \Vj\ < N max }. We 
term these output-dense subgraphs. 

Notation: Before going into the details of our proposed ap- 
proach, let us introduce some useful notation. 

We denote each vertex by a natural number, so V — {1, • • • ,N} 
denotes the set of vertices in G. 

Let ii be the i'th basis vector (an iV-dimensional vector, with 
value 1 in its i'th coordinate, and elsewhere). We will denote a 
subset C C V by its corresponding vector c = ~}2 ieC e-i, and will 
sometimes refer to either interchangeably; we will also on occasion 
denote the cardinality of subset C as |c|. 

Let T u be the neighborhood vector of vertex u: Y u = 

(w lu ,W 2u ,- • ' , VJ Nu ). 

For convenience, we will also make use of the following nor- 
malized version of S n - Let g n = n . ( „_ 1 - l ■ By the monotonicity 
properties of S n , it follows that g n < g n -i- 

Unless explicitly stated, we will focus on the time instant where 
the weight of the edge between vertices a and b is updated from 
Wab = w to w + 6. Whenever a quantity X can be affected by 
this update, we will denote its value before the update as X' and 
its value after the update as X + . We omit this superscript when 
it does not affect results in any way. For example, w a b — w, 

W a b + — W + 8. 

3. THE dynDens APPROACH 

Let us now discuss how our proposed algorithm, DynDens, 
identifies, at every point of time, all output-dense subgraphs. 

Dense subgraphs and growth property: Observe that there is 
an inherent tradeoff in the set of subgraphs that DynDens will 
maintain, which we term "dense" subgraphs. At one extreme, Dyn- 
Dens could opt to maintain only output-dense subgraphs, with the 
other extreme being to maintain all subgraphs. However, neither 
of these is desirable: the former because it does not enable incre- 
mental computation of output-dense subgraphs, the latter due to 
its prohibitive costs. We will subsequently (Section 4.2) formally 
quantify this tradeoff. For now, loosely speaking, we will say that 
C is a dense subgraph iff it has density greater than a given thresh- 
old T\c\ (which is a function of the cardinality of C), and cardi- 
nality of at most N max (for a complete list of density-related terms 
used in this work cf. Table 1). T n is defined in a manner that en- 
sures that every dense graph with n vertices has at least one dense 
subgraph with n— 1 vertices (thus it is possible to identify all dense 
subgraphs by "growing" dense subgraphs of smaller cardinalities). 

Specifically, T n is a monotonically increasing function of n with 
the property T n ■ g n > T n -i ■ g n -i- At a high level, this mono- 
tonicity property ensures the desired containment property men- 
tioned earlier (see Section 4 for details 4 ). Moreover, we require 

4 Another way to view dense graphs is the following: Consider the 



Table 1: Definitions of density-related properties 



Subgraph C is • 



iff 



Static properties 



dense 


dens(C) > T| C | 


sparse 


dens(C) <T ]C] 


output-dense 


dens(C) > T 


too-dense 


dens(C) > Tj C | +1 



Dynamic properties 



stable-dense 
newly-dense 
losing-dense 



dens(Cy > T\ C \ A dens(C) + > T\ C \ 
dens(Cy < T| C | A dens(C) + > T| C | 
dens(Cy > Ti c i A dens(C) + < T |cj 



Table 2: Summary of main symbols used 



Symbol 


Description 


V 

N 


Set of vertices in graph 

Number of vertices in graph 

Weight of edge between vertices i and j 

Neighborhood vector of vertex u 


dens(C) 


Density of C 

dens(C) = ^'^ c ^ {w ^ 
b \c\ 


9n 

AvgWeight 

SqrtDens 

AvgDegree 


Quantifies relative importance of subgraph 
cardinality n to density 


a n-(n — 1) 

Case where S n = n(n — l)/2 
Case where S n = \/n(n — 1) 
Case where S n = n 


^ max 

T 

T n 

Su 


Max. cardinality of subgraph to be returned 

Min. density for a subgraph to be returned 

Min. density for subgraph of cardinality n to be dense 

Tunable parameter of DYNDENS, influences T n 


a, b 

X' 

x + 
w 

w + 5 


Vertices that were just updated 

quantity x before the update 

quantity x after the update 

Weight of edge (a, b) before the update, ie. w ao ~ 

Weight of edge (a, b) after the update, ie. w a b + 



that T]v max = T, We discuss the concrete instantiation of T n 
used by DynDens in Section 4.2. 

Edge weight updates: The basic operation of DynDens is to 
maintain dense subgraphs, following the update of the weight of an 
edge (a, b), from w to w + 8. If this impacts the set of output-dense 
subgraphs, the latter is updated as well. Handling updates with 8 < 
(i.e. where the weight of an edge decreases) is straightforward: 
all dense subgraphs containing both a and b are examined, and their 
density is decreased by an appropriate amount. If they are no longer 
output-dense, this is reported; if, in addition, they are no longer 
dense (losing-dense), they are evicted from the index. 

Positive updates: Of greater interest is the case where 8 > 0, 
i.e. the edge weight update corresponds to an increase in weight. 
In this case, additional subgraphs, that were not dense prior to the 
update, might now be dense (newly-dense subgraphs). DynDens 
leverages the growth property to compute these as follows: 



measure normDens(C) 



dens(C) 
"id 



consisting of a density mea- 
sure, normalized by the threshold function T n ; a graph C is dense 
iff it has normDens(C) > 1. While normDens(C) is not a 
suitable measure of density per se, it has the following important 
growth property: every graph C has a subgraph C" of cardinality 
\C'\ — \C\ — 1 with normDens(C') > normDens(C). This 
containment/growth property additionally implies that, if there are 
no dense subgraphs of cardinality n, there can be no dense sub- 
graphs of any cardinality > n. 

5 Recall that T n is an increasing function of n, and the set of main- 
tained subgraphs needs to include all output-dense subgraphs of 
cardinality < N m ax having density > T. 
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Algorithm 1 Algorithm DynDens 



Algorithm 2 Procedure explore(C, i) 



Input: Updated edge (a, b), magnitude of update <5 
1: if S < Othen 

2: Update the density of all dense subgraphs containing a and 
6; evict losing-dense subgraphs from the index; report any 
subgraphs that are no longer output-dense 
3: return 
4: for all dense subgraphs C st. a G C V b G C do {// including 

C = {a, b} if it is newly-dense} 
5: ifag Cor 6^ C then 

6: if C should be cheap-explored and C U {a, 6} is newly- 

dense then 
7: Add C U {a, 6} to the index, report it if it is output- 

dense 
8: explore(C*U{a,6},2) 

9: else 

10: Update the density of C, report it if it just became output- 

dense 
11: explore ( C, 1) 



Cheap explore: DynDens will try to augment all dense sub- 
graphs containing either a or b, with b or a, respectively; resulting 
newly-dense subgraphs will be inserted into the dense subgraph 
index. In some cases, this step alone is sufficient and/or can be ap- 
plied only to a subset of these subgraphs (cf. Section 6) for details). 

Explore: DynDens will try to augment dense subgraphs con- 
taining both a and 6, with one neighboring vertex; resulting newly- 
dense subgraphs will be inserted into the dense subgraph index. 

Exploration iterations: The above procedure may need to be 
performed iteratively for newly-dense subgraphs discovered via ex- 
ploration or cheap exploration. Interestingly, the iteration depth is 
upper bounded by a corollary of the growth property. Specifically, 
in Section 4.2, we define T n parametrized by a parameter 8u that 
indirectly controls the number of dense subgraphs maintained by 
DynDens. As we show in Section 4, we can guarantee that at 
most \-^-~\ iterative exploration iterations need to be performed, 
in order to identify all newly-dense subgraphs, following an edge 
weight update of magnitude S. 

Explore all: In a few cases, the above exploration may need 
to be performed on non-neighboring nodes as well, resulting in a 
very costly procedure. In most cases, DynDens avoids performing 
this procedure via a better, implicit representation of some dense 
subgraphs in the index (cf. Section 3.2.3). 

In one sentence, DynDens explores the neighborhood of some 
materialized dense subgraphs, using pruning conditions for when 
to stop exploring around a subgraph. The remainder of this section 
aims to fill in the blanks in the preceding sentence. We discuss 
the workings of DynDens, and illustrate them with a practical 
example in Section 3.1, followed by important technical details in 
Section 3.2. We defer the exposition of the theoretical results on 
which DynDens is based till Section 4. 

3.1 The DynDens Algorithm 

Let us now discuss DynDens in greater detail, with reference to 
Algorithm 1. At a high level, DynDens maintains an in-memory 
index of all dense subgraphs (we defer discussing index implemen- 
tation details to Section 3.2); at every edge weight update, it out- 
puts information regarding subgraphs that became, or stopped be- 
ing output-dense. If the edge weight update was negative, only 
some index maintenance needs to be done (line 2). Otherwise, 
some stable-dense subgraphs containing a and/or b are further ex- 



Input: Subgraph C. Iteration number i 
1: if C was not too-dense before the update and i < \-^-~\ and 



\C\ < N„ 



then 



10: 



if C is too-dense then 

for all y $ C do {// Explore-All} 

Add C U {y} to the index; report it if it is output-dense 
explore(CU {y},i + l) 
else 

for all neighbors y of C do 
if C U {y} is newly-dense then 

Add C U {y} to the index; report it if it is output- 
dense 
explore(CU {y},i + 1) 



amined (lines 4-11). Note that, to ensure correctness, also the sub- 
graph {a, 6} may be examined, even if it was not present in the 
index (base case in line 4). Subgraphs in the index containing only 
one of a, & are cheap-explored, if needed 6 (line 6). 

Subgraphs in the index that contain both a and b, as well as 
newly-dense subgraphs previously identified, are subsequently ex- 
plored (line 11) - i.e. DynDens will try to augment them with a 
neighboring node (we defer discussing the precise details on how 
this is done efficiently to Section 3.2). This will be recursively re- 
peated on any newly-dense subgraphs discovered up to \^-~\ times 
(the theoretical results that enable this bounding are discussed in 
Section 4). A high-level description of the exploration procedure is 
shown in Algorithm 2. 

Algorithm 2 will first ensure that the subgraph should be ex- 
plored. Specifically, the subgraph should not have been too-dense 
before the update (line 1), for otherwise its dense supergraphs would 
have been stable-dense, and hence already identified. Moreover, as 
previously mentioned, DynDens will not explore around any sub- 
graph more times than necessary. Finally, in a few cases, explored 
subgraphs will need to be augmented with every other vertex, not 
just neighboring ones (Explore- All; line 2). As the latter is a costly 
procedure, in Section 3.2.3 we will present a way to mitigate the 
associated cost. 

Execution example. To illustrate the workings of DynDens, 
let us examine a simple example of its execution. Consider the 
sample entity graph of Figure 2(a), and assume an AvgWeight 
definition of density (i.e. the density of a subgraph is its average 
edge weight), a density threshold of T = 1, and a maximum de- 
sired subgraph cardinality of N max = 4. Assume that 5u has been 
set to 0. 15, so that the thresholds T n , for subgraphs of cardinality n 
to be considered dense are T 2 = 0.9, T 3 = 0.975 and T 4 = T = 1 
(cf. Section 4.2 for details). Thus, the dense subgraphs for this 
graph are shown in Figure 2(b) (output-dense subgraphs are em- 
phasized). Finally, assume that the weight of edge (1,2) is updated 
from 0.8 to 0.95 (S — 8 it — 0.15). Let us examine how DynDens 
will handle this update; to facilitate this discourse, the newly-dense 
subgraphs that are inserted into the index are shown in the bottom 
half of Figure 2(b). 

At a high level, DynDens will examine {1,2}, as well as all 
dense subgraphs containing vertex 1 and/or 2 (Algorithm 1, line 4), 
i.e. {1,3}, {1,4}, {2,3}, {2,4}, {1,3,4}, {2,3,4}. {1,2} will 



For instance, subgraphs that were too-dense need not be explored, 
as, by definition, their dense supergraphs would have been stable- 
dense, and hence already identified. Moreover, this step can also 
be skipped in other circumstances, cf. Section 6 for details. 
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(b) Dense subgraph index 
Figure 2: Execution example 



be added to the index (Algorithm 1, line 10), and will be explored 
(line 11). Its exploration will entail the addition of newly-dense 
subgraphs {1, 2, 3} and {1, 2, 4} to the index (Algorithm 2, line 8); 
the former will also be reported as output-dense. Since — = 1, 
these newly-dense subgraphs will not be further explored (Algo- 
rithm 2, line 10 and line 1). Moreover, during this exploration sub- 
graph {1, 2, 5} will be examined, but as its density is less than T3, 
it will not be added to the index. 

DynDens will also cheap-explore subgraphs {1,3}, {1,4}, 
{2, 3}, {2, 4} (Algorithm 1, line 6). This will result in subgraphs 
{1, 2, 3}, {1, 2, 4} being examined (twice) (Algorithm 1, line 7); as 
they are already present in the index, this will not affect anything. 
Moreover, DynDens will attempt to explore these subgraphs (Al- 
gorithm 1, line 8); however, since -J- — 1, they will not be ex- 
plored (Algorithm 2, line 1). 

Finally, DynDens will cheap-explore subgraphs {1,3,4} and 
{2, 3, 4}. The first cheap exploration will result in newly-dense 
subgraph {1,2,3,4} being added to the index, and reported as 
output-dense (Algorithm 1, line 7); the second one will revisit this 
subgraph, and do nothing. Moreover, in both cases, since |{1, 2, 3, 
4} I = 4 > N max , these subgraphs will not be explored (Algo- 
rithm 2, line 1). 

Observation: From the simplified execution example presented 
above, one can observe that DynDens (as currently presented) 
can end up performing redundant computations; e.g. some sub- 
graphs are examined unnecessarily many times. Subsequently, in 
Section 3.2.2 and Section 6, we discuss how to reduce such unnec- 
essary computations. 

3.2 Implementation Considerations 

Having presented DynDens at a high level, let us now see some 
important considerations that arise when implementing it in prac- 
tice. We first introduce the underlying indexing structure used by 
DynDens in Section 3.2.1; this index also enables DynDens to 
avoid redundant computations (Section 3.2.2) as well as the costly 
operation of explore-all (Algorithm 2, line 2 cf. Section 3.2.3). 

3.2. 1 Index 

DynDens requires an efficient index for both the evolving graph 
itself, as well as for dense subgraphs. For the graph index, main- 
taining node adjacency lists is sufficient (i.e. a mapping V« £ V : 
u — > f„); this also enables the efficient exploration of a subgraph 
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Figure 3: Dense subgraph index 

(via merging the relevant adjacency lists 7 ). 

The dense subgraph index is more interesting to examine, as it 
needs to efficiently support several functionalities. To name a few: 
for every dense subgraph, access to its vertices, cardinality and den- 
sity; insertion, update and deletion of dense subgraphs from the in- 
dex; iteration over all dense subgraphs containing vertices a or b, 
where each subgraph must be accessed exactly one time (needed for 
positive edge weight updates); and for a given dense subgraph C, 
and a given vertex u, access to subgraph C U {u}, and insertion of 
CU{«} into the index if it is not already present (needed for explo- 
ration). Moreover, as DynDens needs to perform frequent random 
accesses on dense subgraphs, the index needs to be in-memory, so 
maintaining a low memory footprint is important. As most dense 
subgraphs will tend to have high overlap, the dense subgraph index 
should minimize the amount of redundant information stored. 

To address these requirements posed by DynDens, we pro- 
pose the following in-memory index. Each subgraph has a unique 
id corresponding to its location in memory; it is also represented 
by its (sorted) set of vertices. DynDens will maintain a pre- 
fix tree of dense subgraphs, illustrated in Figure 3. Each node in 
the prefix tree contains pointers to its children, indexed by ver- 
tex id, a pointer to its parent, as well as information (such as car- 
dinality and density) on the dense subgraph it represents, if ap- 
plicable. Figure 3 shows a view of the index when subgraphs 
{1, 3}, {1, 3, 4}, {1, 3, 5}, {3, 4, 5}, {4, 5} are dense (ignore node 
labeled * for now), along with the density of each subgraph. 

Additionally, to enable effective iteration over dense subgraphs 
containing one or two given vertices, DynDens will also maintain 
inverted lists, i.e. a mapping from vertices to (pointers to) all sub- 
graphs containing a vertex. To decrease the size of inverted lists, the 
inverted list of a vertex u will only contain tree nodes where the lex- 
icographically largest vertex is u. Thus, in order to iterate over all 
subgraphs containing u, DynDens will iterate over all subgraphs 
in its inverted list, and their tree descendants. Furthermore, to fa- 
cilitate inverted list maintenance, inverted lists are implemented as 
linked lists of prefix tree nodes (shown in Figure 3 as dashed ar- 
rows). Inverted lists are updated whenever a new node is created , 
or when a leaf node is deleted. Moreover, if the deletion of a leaf 
node results in its parent having no children, and representing no 
dense subgraph, the parent will be recursively deleted. 

Our proposed dense subgraph index efficiently addresses the re- 
quirements of DynDens. Specifically, the prefix tree enables Dyn- 
Dens to reduce its memory footprint, by not storing redundantly 
many overlapping dense subgraphs. Moreover, looking up CU{«} 



'Specifically, when exploring subgraph C, DynDens will com- 
pute Tc — Euec Ti> ; f° r every ver tex u ^ C, the score of CL){u} 
can be computed as score(C U {u}) = score(C) + Tc ■ e u . 
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is 0(\C\ + 1) in all cases (and O(l) if vertex u is lexicographically 
greater than any other vertex in C); after a look-up, update or inser- 
tion into the index is O(l). Enumerating the vertices in a subgraph 
C is 0(|C|), via parent pointer traversal. Deleting a subgraph C 
from the index is 0{ number of leaf nodes deleted); this is typically 
O(l) and at worst 0(|C|), due to the design of the prefix tree with 
embedded inverted lists. 

3.2.2 Avoiding redundant computation 

Besides efficiently providing the requisite functionality for Dyn- 
Dens, our proposed dense subgraph index can also be used (i) to 
ensure that subgraphs that were dense before the update are exam- 
ined exactly once (required for the correctness of DynDens), and 
(ii) to greatly reduce the number of newly-dense subgraphs exam- 
ined more than once (without sacrificing correctness). 

The former (i) can be guaranteed by fixing the order in which 
dense subgraphs are examined. Specifically, if subgraphs contain- 
ing vertices a and/or b need to be examined, and assuming a < b 
(lexicographically), DynDens will traverse the subtrees of all in- 
dex nodes on the inverted list corresponding to b. Subsequently, it 
will traverse the subtrees of index nodes on the inverted list corre- 
sponding to a, stopping the traversal whenever a b node is encoun- 
tered. This procedure is aided by flags that are set on a per-index 
node basis, to help DynDens distinguish newly-dense subgraphs 
in the index. 

For the latter (ii), we leverage the theoretical result that all newly- 
dense subgraphs can be identified in at most [~— ] exploration it- 
erations (Section 4). Upon insertion into the index, dense sub- 
graphs are annotated with the exploration iteration at which they 
were identified (i in Algorithm 2); these annotations persist until 
the end of Algorithm 1 . Algorithm 2 will operate as above for sub- 
graphs not annotated with an iteration number, or annotated with 
an iteration number greater than the current i. Otherwise, the sub- 
graph does not need to be further examined. 

3. 2. 3 Implicit representation of too-dense subgraphs 
Having introduced the dense subgraph index used by DynDens, 

let us revisit a challenge posed by the presence of too-dense sub- 
graphs, and show how the index can be leveraged to overcome it. 

Recall that a subgraph is too-dense iff, after adding any other 
vertex to it, it is still dense. Thus, when exploring a too-dense 
subgraph, DynDens needs to consider its cartesian product with 
the entire set of vertices V, resulting in | V\ dense subgraph inser- 
tions into the index (explore-all, Algorithm 2, line 2). This is a 
very costly procedure; unsurprisingly, it was experimentally found 
to dominate all other processing costs, in cases where too-dense 
subgraphs existed (cf. Section 5.1). 

To avoid this cost, we propose a modification to the dense sub- 
graph index, which we term iMPLICITTooDENSE. At a high level, 
it entails the implicit representation of supergraphs of too-dense 
subgraphs, so that explore-all will only examine/insert into the in- 
dex a small number of dense subgraphs. 

Specifically, we introduce a fictitious vertex named *, which is 
lexicographically larger than all other vertices. For every too-dense 
subgraph C, the index will store a subgraph CU {*}, representing 
all C U {y} where y is a vertex disconnected from C; these su- 
pergraphs of C will not be explicitly inserted in the index. Given 
this convention, DynDens will handle the explore-all procedure 
of a subgraph C that just became too-dense by normally explor- 
ing all neighbors of C (as in Algorithm 2, line 7), and inserting 
the subgraph C U {*} into the index. For instance, revisiting Fig- 
ure 3, assume subgraph {1,3} is too-dense. Rather than explor- 
ing, and inserting into the index all its disconnected supergraphs 



{1,3, 6}, {1,3, 7}, ••• ,{1,3, |V|}, DynDens has only inserted 
a node representing {1,3,*}. 

In the unlikely event C U {*} needs to be explored at any time 
(corresponding to the exploration of all supergraphs of C augmented 
with one disconnected vertex), DynDens will try instead to aug- 
ment C with all edges in the graph that are not incident on C. 

Because every vertex a is potentially contained in C"U{*}, when- 
ever an iteration is performed on the index (Algorithm 1, line 4), 
the inverted list corresponding to * needs to be examined as well. 
This inverted list also needs to be maintained during negative edge 
weight updates, if a subgraph stops being too-dense. Finally, note 
that whenever dealing with a subgraph represented by a * index 
entry, DynDens also needs to ensure that the subgraph is not ex- 
plicitly represented elsewhere in the index, which is, however, a 
very efficient operation. 

As we verify experimentally (Section 5.1), the above IMPLICIT- 
TooDENSE modification to the index offers significant performance 
benefits to DynDens. 



4. THEORETICAL RESULTS 

Having introduced our proposed DynDens algorithm, in this 
section we elaborate on its theoretical underpinnings. We first prove 
its correctness, by deriving a bound on the number of exploration 
iterations that are required, as a function of the magnitude of the 
edge weight update performed (this is the basis of DynDens, cf. 
Algorithm 2, line 1). Specifically, Section 4.1 presents a general 
result, on when a single exploration iteration per stable-dense sub- 
graph is sufficient. Section 4.2 provides a concrete instantiation for 
T„ (recall that T n determines the relationship between dense and 
output-dense subgraphs), based on which the desired bound is then 
obtained in Section 4.3. Due to space constraints, detailed proofs, 
and results pertaining to the complexity of DynDens are omitted; 
these can be found in [4]. 

Formalization :The notion of exploration iterations performed 
by DynDens has been used throughout its description; before pre- 
senting theoretical results on them, this would be a good opportu- 
nity to formalize this notion. 

LetCU = {CU{b}\C C VAa G C'Ab $ C'AC is stable-dense} 
be the set of graphs consisting of a stable-dense subgraph contain- 
ing a, augmented with b (similarly, let C'b = {C U {a}|C C 
VAbeCAa^CACis stable-dense}). Let C\> = Ca U Cb\ 
this is the set of all subgraphs that will be examined via cheap- 
exploration only. 

Let Cab = {CU{y}\C C VAa, b G C A C is stable-dense A y 
is a neighbor of some node in C} be the set of graphs consisting of 
a stable-dense subgraph containing a and b, augmented with some 
other node; this is the set of all subgraphs that will be examined via 
a single exploration iteration. 

Let C'i — Co U Cab', this is the set of graphs containing a and b 
that consist of a stable-dense subgraph, augmented with one node. 

For i > 1, let d = {C U {y}\C G C;_i A C is newly-dense A y 
is a neighbor of some node in C}. d is the set of graphs containing 
a newly-dense subgraph that contains a, b, and is discoverable after 
i exploration iterations. 

4.1 When is a Single Exploration Sufficient ? 

Let us now provide a sufficient condition for all newly-dense 
subgraphs C of cardinality \C\ — n > 3 to contain a stable-dense 
subgraph of cardinality n — 1. Specifically, it is sufficient that: 



5<{n- 2)(n - !)(<?„ • T n - Sn _i • T n _i) 



(1) 
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(recall that g n = 



. and that the properties of T n guaran- 



n-(n— 1) ' 

tee that the above bound on 8 is strictly positive) 

Proof sketch: (pigeonhole argument) If all n — 1 subgraphs 
of C were sparse before the update, then the contributions of each 
vertex in C to dens' '(C) should be large. Hence, C must be very 
dense. However, C was sparse before the update. Thus, the update 
must have been very large. If the update is not very large, then there 
will exist ann-1 subgraph that was dense before the update. 

Corollary: The n — 1 subgraph of C that was dense before 
the update will either not contain one of a or b (so augmenting 
it with that vertex will yield C), or it will contain both a and b. 
Consequently, for values of n where Equation 1 holds, all newly- 
dense subgraphs of cardinality n will be contained in Ca U Cb U 
Cab = Ci. 

4.2 Instantiating T n 

Based on the form of Equation 1, let us now propose a con- 
venient instantiation for T n , that will satisfy the requisite mono- 
tonicity properties, while greatly simplifying the bounds we sub- 
sequently derive, thus providing additional intuitions. Specifically, 
the instantiation of T n that will be used throughout this work is: 



T n 



fin 



9N„ 



■T + Su 



-** max 



-** max 



(2) 



where 8u is a tunable parameter. Note that this is a reasonable 
value for T n from a maintenance perspective ; for instance, if S„ = 
n,thenT„ = (n - 1)T 2 + (n - 2)8u = N ^-i {T + S it ) -5 it = 
0(n), while if S n = n(n - 1), then T n = T 2 + (1 - ^ )fe = 

Importantly, this instantiation results in a much simplified form 
of Equation 1, specifically 8 < 8u. In the following, we will lever- 
age this fact, to obtain a bound on the number of exploration itera- 
tions that DynDens needs to perform. 

Moreover, for our proposed techniques to be meaningful, it must 
be the case that T n >> Vn £ {2, • • • , N max }. This, along with 
the above simplified form of Equation 1, leads to the following va- 
lidity range for 8u: 8u £ (0,jf — ^ ax — r^y)- The lower bound 
would correspond to maintaining the smallest possible number of 
subgraphs, and the upper bound to maintaining most subgraphs 
(specifically, all subgraphs of cardinality N max , and most sub- 
graphs of lower cardinalities) - realistically speaking, one should 
not set 8u to any value close to its upper bound. 

4.3 Bounding the Number of Iterations 

We are now able to extend Equation 1, to cases where 8 > Su. 

Specifically, we will show that all newly-dense subgraphs of car- 
dinality n are contained in Cn U Ci • • • U Cr s n , thus in or- 
der to compute all newly-dense subgraphs, it is sufficient to ex- 
plore around stable-dense and newly-dense subgraphs contained in 

C() U Cl U • • • U Cr 5 -,. 

Proof sketch: An update of magnitude 8 is equivalent to \j-~\ 
updates of magnitude up to 8u\ furthermore, re-exploring stable- 
dense subgraphs will not yield any new dense subgraphs, thus only 
newly-dense subgraphs will need to be explored subsequently. 

Discussion: As witnessed from the above result, the magnitude 
of 5 is directly correlated with the impact on dense subgraphs. A 
useful analogy is that of an edge weight update as a perturbation: 
the greater its magnitude 8, the further away in the graph its effects 
can be potentially felt (i.e. the further away dense subgraphs will 
need to be explored). 



In this context, parameter 8u offers a tunable space-time trade- 
off. By setting it to higher values, more dense subgraphs will be 
maintained, but fewer exploration iterations will be required per 
edge update. By setting it to lower values, the space overhead 
(i.e. the number of dense subgraphs maintained that are not output- 
dense) can be made minimal: nearly for AvgWeight, and com- 
parable to an offline approach otherwise 8 . Consequently, selecting 
an optimal good value for 8u is data-dependent; in practice, we ob- 
serve that DynDens performs well for a wide range of Su values. 

5. EVALUATION 

Let us now discuss the experimental validation of our techniques. 
We will first briefly go over the experimental setup. In Section 5.1 
we will present experimental evidence for the feasibility of real- 
time story identification via ENGAGEMENT, as well as the scala- 
bility of our proposed approach. We will also examine the main 
factors that contribute to the efficiency of DynDens. 

As we have seen throughout this work, there is a lack of existing 
techniques for efficiently addressing ENGAGEMENT. Nevertheless, 
in Section 5.2 we evaluate adaptations of relevant techniques to this 
problem, so as to have a basis for comparison. 

Finally, although efficiency has been our main focus in this work, 
in Section 5.3 we present some qualitative results that highlight the 
effectiveness of our approach. 

Experimental setup: All algorithms evaluated were imple- 
mented in Java, and executed on 64-bit Hotspot VM, on a machine 
with 8 Intel(R) Xeon(R) CPU E5540 cores clocked at 2.53GHz. In 
our experiments, only one core was used, and the memory usage of 
the JVM was capped at 25G of RAM (the actual memory consump- 
tion was typically lower). Finally, in all performance experiments, 
the time reported is the median time of 3 identical runs. 

Datasets: Unless otherwise noted, all our experiments were run 
using real-world datasets, based on a sample of all tweets for May 
1st, 2011 (Our dataset consisted of 13. 8M tweets. The sampling 
was performed by Twitter itself, as part of the restricted access pro- 
vided to its data stream; for details cf. tinyurl . com/twsam). 
From these, we removed non-English tweets, and tweets that were 
labeled as spam (using an in-house tweet spam filter [24]), resulting 
in 3.8M tweets. Subsequently, we used an in-house entity extrac- 
tor [3] to identify mentions of real-world entities (such as people, 
politicians, products, etc). 76.5% of the tweets did not mention any 
entity of interest; 18.3% mentioned one; 4.3% mentioned two, and 
under 1% mentioned three or more entities. The entire procedure 
took under lh 20' (under 350 /isec per tweet on average). 

Measuring correlation: Given these sets of co-occurring enti- 
ties, there are many ways in which entity association can be mea- 
sured; our techniques are equally applicable, irrespective of the 
measure used. For our evaluation, we selected two measures from 
the literature that we found to yield meaningful results under di- 
verse circumstances: a combination of the \ 2 measure and the 
correlation coefficient inspired by [5] (weighted dataset), that has 
been found to be highly effective in identifying stories in the blogo- 
sphere, as well as a thresholded variant of the log-likelihood ratio 
[26] (unweighted dataset) that has been successfully used to iden- 
tify stories in Grapevine over an extended period of time. In gen- 
eral, we note that any measure that measures strength of pairwise 
association, based on entity occurrences and pairwise co-occurrences 
can equally be used by our techniques. 

Identifying emerging stories: Since the goal of our techniques 



All exact offline approaches, to the best of our knowledge, utilize 
some form of a growth property, hence need to compute as many 
subgraphs as DynDens with 8u ~ 
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is to identify stories in real-time, i.e. "stories happening now", a 
mechanism for discounting older stories is required. To achieve 
this, we modify our measures of correlation, by applying exponen- 
tial decay to all entity occurrences and co-occurrences; for instance, 
in our experiments we used a mean life for a tweet of 2 hours. 
Note that our techniques are equally applicable without applying 
any decay, but the stories identified would then correspond to "cu- 
mulative stories to date" (cf. Table 3 showing stories for the entire 
day) as opposed to "current emerging stories" (cf. online demo 
www. onthegr ape vine . ca/now . jsp). 

Approximating complex association measures: Finally, for 
many measures of association (e.g. statistical measures, such as 
the log-likelihood ratio), the appearance of a document with just 
a single entity, can influence the weight of all edges in the graph 
(e.g. the log-likelihood ratio of a pair of entities is a function of the 
number of documents that have appeared to date). This would pose 
a significant challenge to incremental computations; to overcome 
it, we make use of the following approximation, that is applicable 
to any measure: the weight of an edge connecting entities ei , e? is 
computed by ignoring all documents that have appeared after the 
latest time that either e\ or e-z appeared in some document. 

Intuitively, this will not significantly affect edges connecting pop- 
ular entities; indeed we observed that in practice the resulting drop 
in precision entailed by this approximation was fairly low 9 . Impor- 
tantly, this approximation enables us, after observing a document 
that mentions entities ei, • • ■ , ej, to only update the weights of 
edges that are incident to at least one of these entities are updated, 
i.e. only the weights of edges {(et,X)\i £ {1, • • • , j}, X G V} 
will be updated. 

Taking the above into account, the precise manner in which our 
experimental datasets were created is as follows. 

For every tweet where at least one entity was identified, en- 
tity occurrences and co-occurrences were updated (taking expo- 
nential decay into account, with a mean tweet life of two hours). 
Thereafter, in the case of the weighted dataset, the \ 2 an d corre- 
lation coefficient of salient entity pairs was updated; the updated 
edge weight was computed as max( correlation coefficient , 0) 
if x 2 showed significant correlation (p < 5%), and otherwise. 
This procedure resulted in 952K positive and 40.5M negative edge 
weight updates (recall that the latter are very cheap to process). 

In the case of the unweighted dataset, the log-likelihood ratio 
of salient entity pairs was updated. Two entities were connected 
with an edge iff each entity appeared in at least 5 tweets, and log- 
likelihood showed significant correlation (p < 1%). This proce- 
dure resulted in 43K positive edge weight updates (edge additions), 
and 41K negative ones (edge removals). 

In either case, this step took under 90 seconds for the entire day. 
The streams of edge weight updates were loaded to memory be- 
fore initiating our experiments, and the updates were provided to 
DynDens sequentially, and in-memory. This reflects the expected 
usage of DynDens, as the edge weight updates that constitute its 
input will typically be generated by another process in real-time. 
All times reported correspond to the time required to process all 
edge weight updates resulting from a dataset, while maintaining 
output-dense subgraphs after each update. Specifically, they do not 



include the time required to preprocess the dataset (e.g. entity ex- 
traction, correlation computation), nor do they include the fixed 
initialization costs of DynDens (such as JVM initialization and 
initialization of necessary indexing structures). It is worth noting, 
however, that the throughput of DynDens can more than match 
the stream rate, even after factoring in all preprocessing steps (in 
total, the overhead for all preprocessing and execution of DYN- 
DENS for our dataset of one day was generally under 90 minutes; 
moreover the most costly preprocessing steps - i.e. named entity 
extraction- are inherently parallellizable). 

5.1 Efficiency and Scalability 

Let us now examine some of our experimental findings. Fig- 
ures 4(a)-4(d) show the time required to process all updates from 
either dataset, for a variety of definitions of density (experiments 
involving additional density functions can be found in [4]), and for 
a wide range of values of density threshold T, maximum dense 
subgraph cardinality N max . In these figures, Su has been set to 
1% of its maximum value, given the values of the other parameters 
(thus the number of maintained dense subgraphs is typically close 
to the number of output-dense subgraphs). All runs were capped 
at 10 minutes (runs that took longer than that were terminated); all 
figures are cropped to exclude such time-outs 10 . 

We observe that DynDens is able to very efficiently process 
large datasets, across a wide range of useful operating parame- 
ters, validating its applicability for efficiently addressing ENGAGE- 
MENT. The chosen parameters range from instances with none, 
or only a few output-dense subgraphs, to instances with too many 
output-dense subgraphs (in the thousands); i.e. the extremal param- 
eter values correspond to instances of less practical interest. Inter- 
estingly, one can observe a sharp increase in performance beyond 
certain values of parameters T and N max . This is due to the en- 
suing sharp drop in the average number of output-dense subgraphs. 
For instance, with reference to Figure 4(c), the average" number 
of output-dense subgraphs of cardinality at most 6, for T — 1 is 
3.4K; for T = 0.8 it is 13.4K; while for T = 0.7 it is over 52K. 
Similar trends can be observed in the other figures as well; cf. [4]. 

Having discussed the scalability and efficiency of DynDens, let 
us now turn to evaluating its inner workings. Firstly, let us examine 
the effects of the Su parameter. Recall that, low values of Su cor- 
respond to DynDens materializing fewer dense subgraphs, and, 
correspondingly, having to perform potentially more explorations. 
In our experiments, we found our techniques to perform equally 
well for a wide range of values of Su ; however, selecting a value 
for it, based on characteristics of the dataset can be beneficial to 
performance. In Figure 4(e), we show the time taken by DYN- 
DENS to process the unweighted dataset (note the semilog scale), 
for N m ax — 10 and AvgWeight, across all possible values for 
Su (shown normalized to its maximum value for each threshold). 
We observe an interesting local optimum wrt. Su, arising from the 
tradeoff of having to materialize more subgraphs, while enabling 
faster updates; i.e. increasing Su improves performance, up to a 
point where the additional dense subgraphs that need to be main- 
tained make this a performance drain. For instance, this point is 
around 0.2 for T = 0.8, around 0.1 for T = 0.9, and around 0.6 
for T — 1. It is also interesting to note that this tradeoff comes into 



Specifically, we measured the error entailed by this approxima- 
tion, i.e. the absolute difference of the approximated value of each 
edge weight, minus the actual value of the correlation measure, for 
all edges, at 100 uniformly distributed time instants. The median 
error over all edges was invariably 0; the average absolute error 
over all edges and all time instants was 0.0003 for the weighted 
dataset, and 0.002 for the unweighted one, and the average relative 
error was 10% and 6% respectively. 



The only data points that had terminated runs are outside the dis- 
played range; these instances had too large a number of output- 
dense subgraphs, as a result of unrealistic values for T, N ma x 
and/or Su, and were not expected to finish in a reasonable time 
11 Averaged over all updates, and excluding output-dense subgraphs 
that are not represented in the index, e.g. most too-dense subgraphs, 
augmented with a non-neighboring node (cf. Section 3.2.3). 
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Figure 4: Experimental evaluation 



play again for T — 1 and high 5u. 

As we previously saw in Section 3.2.3, iMPLICITTooDENSE 
is crucially important for DynDens to operate efficiently, in the 
presence of too-dense subgraphs. We validated this intuition exper- 
imentally, by executing a variant of DynDens that did not make 
use of IMPLICITTooDENSE, on the weighted dataset, and compar- 
ing its runtime to that of DynDens. We experimented with exe- 
cution parameters (N max G {9, 10}, T G [0.44, 0.5] and with 5u 
between 1% and 50% of its maximum value, given the values of the 
other parameters. Invariably, the variant without IMPLICITTooD- 
ENSE took longer than 20 minutes to complete (and was killed after 
20 minutes, in the interests of brevity), while DynDens took 40- 
85 seconds to complete. 

5.2 Comparison with Other Techniques 

As we have already discussed throughout this work, to the best of 
our knowledge, prior to DynDens, no techniques have been pro- 
posed for efficiently addressing ENGAGEMENT in its general form. 
Thus, in order to have a basis for comparison, in this section we 
evaluate adaptations of relevant techniques to subsets of ENGAGE- 
MENT, namely the dynamic maximal clique algorithm proposed 
in [27] (Stix), the Greedy Randomized Adaptive Search Proce- 
dure used to identify large quasi-cliques in [1] (GRASP), as well 
as a baseline efficient offline procedure that periodically recom- 
putes all AvgWeight dense subgraphs (BASELINE). We wish to 
stress that, by its very nature, these comparisons are not fair, as the 
goals of the aforementioned techniques are entirely different from 
those of ENGAGEMENT, while said techniques are not as general 
as DynDens. 

Let us review each comparison in detail. The STIX algorithm 
[27] identifies all maximal cliques in dynamic unweighted graphs. 
This is similar to ENGAGEMENT for T = 1, AvgWeight and 
unweighted graphs, but subtly different, in that ENGAGEMENT re- 
quires the identification of all cliques. Recall that the output of 
ENGAGEMENT will be used to present stories to a human user, thus 
the subgraphs produced cannot be too large. If STIX were used to 



address ENGAGEMENT, and a maximal clique of cardinality e.g. 
20 were identified, all its subgraphs of cardinality e.g. 5 or less 
would need to be enumerated, and provided as output. 

Keeping in mind the caveats above, we implemented STIX us- 
ing an efficient in-memory hash-based index 12 , and executed it on 
the unweighted dataset, measuring its execution time, and ignor- 
ing the time that would be needed for enumerating all subgraphs of 
maximal cliques. We compared this runtime to DynDens with 
AvgWeight, T — 1 (so as to have a basis for comparison), 



N„ 



= 5, ; and set 5u to half its maximum value, given the 



values of the other parameters. 

Even though a comparison of STIX and DynDens is entirely 
artificial, the runtime of STIX and DynDens were roughly equal: 
STIX took 958 seconds to process the dataset, compared to 936 
sec for DynDens. DynDens performed even better for lower 
N m ax, and took more time for higher N max . Thus, we conclude 
that DynDens is best suited to applications of ENGAGEMENT, 
while STIX is preferable for applications that require identifying 
maximal cliques in unweighted subgraphs. 

Let us now review the comparison to GRASP, proposed in [1]. 
This is an approximate randomized algorithm for identifying large 
dense subgraphs in unweighted graphs. While [1] has significantly 
more general contributions, for the purposes of this discussion, 
the algorithm proposed therein can be used to identify subgraphs 
with density over a given threshold T, under AvgWeight, in un- 
weighted graphs. GRASP will not necessarily identify all dense 
subgraphs, but can be executed multiple times per update, to iden- 
tify an increasingly larger number of such subgraphs. It is im- 
portant to note that, again, the comparison with DynDens is not 



"[27] does not provide indexing details, so we opted for an efficient 
solution, albeit with high memory consumption. We also experi- 
mented with an adaptation of STIX that used our proposed index, 
which has much lower memory requirements, but this invariably 
resulted in increased runtime for STIX. 

13 Since the goal is story identification, we set Nmax to a low value, 
corresponding to story cardinalities suitable for humans. 
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Table 3: Top stories, May 1st 2011 



Pres. Obama announces killing of Osama bin Laden involving: 
Barack Obama, U.S. House Permanent Select Committee on 
Intelligence, Osama bin Laden,NBC News 



Commentary on death of bin Laden, comparison to famous 
athletes involving : Barack Obama.LeBron James, Delonte 
West.Osama bin Laden 



Discussions on Lady Gaga's activities involving: Lady Gaga.Galeria 



Libya crisis: NATO Airstrike results in death of 3 grandchildren of 
Gaddafi involving: NATO,Libya 



Discussions on Harry Potter involving: Hermione Granger,Draco 
Malfoy.Bella Swan 



News on Osama Bin Laden 's Death Spreads On Twitter 

involving 15 : Clint Eastwood,Barack Obama.U.S. House Permanent 
Select Committee on Intelligence, Osama bin Laden, CBS News 



straightforward, as GRASP is geared towards identifying a few large 
dense subgraphs, as opposed to all dense subgraphs. 

Nevertheless, we implemented GRASP, using an efficient hash- 
based in-memory index 16 . We set the parameter a that controls 
its greediness vs. randomness tradeoff to 0.5, after ensuring this 
did not result in any significant performance differences 17 . We ex- 
ecuted GRASP on the unweighted dataset, for a varying number of 
iterations per edge weight update (more iterations mean higher run- 
time, and a higher likelihood of identifying more dense subgraphs), 
and measured its runtime, and recall (fraction of output-dense sub- 
graphs that it identified, excluding disconnected subgraphs, which 
it does not produce). We limited GRASP to searching for subgraphs 
of cardinalities up to N max — 5, and normalized the runtime of 
GRASP to the runtime of DynDens for the same parameters 18 (i.e. 
the normalized runtime of DynDens is 1). The normalized run- 
time of GRASP is reported in Figure 4(g), and its recall in Fig- 
ure 4(f). As we can see, GRASP offers a runtime/recall tradeoff, 
and can thus be at times more efficient than DynDens (however, in 
such cases, it offers recall of under 80%). Moreover, GRASP offers 
diminishing returns wrt. recall (i.e. it takes increasingly many iter- 
ations to achieve arbitrarily high recall; even though the increase in 
runtime is linear wrt. the number of iterations, the increase in recall 
is decidedly sublinear). Thus, in this context, GRASP is best suited 
to identifying a sample of all dense subgraphs. However, since high 
recall is of crucial importance in story identification (missing 20% 
of important stories would not generally be acceptable), DynDens 
is best suited to addressing ENGAGEMENT in this setting. 

Finally, we also investigated a simple baseline approach (BASE- 
LINE), which periodically recomputes all output-dense subgraphs 
wrt. AvgWeight. The aim of this comparison was to validate 
the necessity for incremental computation as opposed to periodic 
offline recomputation. We implemented BASELINE using an effi- 
cient hash-based in-memory index, and executed it on our experi- 
mental datasets with varying parameters (T, N max ), and at varying 
uniform sampling intervals (i.e. every X tweets). We measured 



14 A Cleveland blogger compared Osama bin Laden to athlete Le- 
Bron James; the discussion continued on Twitter, resulting in a 
sports-related meme around the death of bin Laden. 
15 C.Eastwood was mentioned in conjunction with this story as part 
of a humorous meme started by comedian Steve Martin on Twitter. 
16 The index used in [1] is optimized for secondary storage, hence 
not very useful for the purposes of our comparison. 
17 The average (over the values of all other parameters tested) stan- 
dard deviation of varying a £ (0, 1) was 4%, and the median stan- 
dard deviation was 1%. 

18 For DynDens we selected a reasonable value of Su, given the 
values of the rest of the parameters. 



the number of recomputations that BASELINE was able to perform, 
given the same time as DynDens took for the entire dataset 

Even given the above restricted problem setting, we observed 
that BASELINE was generally not up to the task of realtime story 
identification. In our weighted dataset, and for a wide range of 
parameters, it was able to perform up to 15-30 recomputations in 
the same time that DynDens processed the entire dataset (corre- 
sponding to identifying new stories every 48-96 minutes"). In the 
unweighted dataset (which had on average fewer edges, and was 
thus more amenable to reprocessing from scratch), BASELINE did 
somewhat better, performing 135-300 recomputations for the pa- 
rameters we experimented with (corresponding to identifying new 
stories about every 5-10 minutes). More detailed results can be 
found in [4]. We conclude that, although periodic recomputation 
may be an option in limited scenarios (e.g. unweighted graphs, 
AvgWeight, not very strict realtime requirements), in general the 
performance benefits of incremental recomputation are needed to 
support realtime story identification. 

5.3 Qualitative Results 

Whereas the focus of this work is to efficiently identify dense 
subgraphs in an incremental manner, we also provide evidence of 
the effectiveness of our approach. Evaluating the quality of our re- 
sults for realtime story identification is both inherently challenging, 
due to the lack of a ground truth for what constitutes an important 
story for a given medium (e.g. a micro-blogging site vs. a news 
agency), as well as beyond the scope of this work. We will thus 
present some sample results of utilizing dense subgraphs for story 
identification. We have also built a live demo for our techniques, 
which we will briefly discuss, and encourage interested readers to 
visit so as to view this work in action. 

In order to present sample results, we chose to focus on stories 
at the granularity of a single day (since presenting stories that were 
heavily discussed at a specific date and time would be hard to pro- 
cess out of context). We used a dataset similar to the "unweighted" 
one from our performance experiments, with the following two 
modifications: entity correlations were computed over the entire 
dataset, as opposed to using exponential decay; and edge weights 
were retained for pairs of entities with log likelihood of over 5% 
significance, rather than being thresholded and restricted to {0, 1}. 
We computed dense subgraphs of cardinality up to N max = 5, us- 
ing AvgDegree to quantify density, so as to favor larger dense 
subgraphs; for presentation purposes these were subsequently re- 
ranked in a diversity-aware manner [2] (subgraph overlap was pe- 
nalized by multiplying subgraph density by 1 — 0.8 • ( fraction of 
story entities covered by previous stories) ). 

Table 3 presents the resulting top stories. We observe that dis- 
cussions on bin Laden's death feature prominently in the list; more- 
over, given the typical conversation tone on Twitter, distinct discus- 
sions involved comparing the presidential announcement to famous 
athletes 14 , and even the rapid propagation of the news on Twitter. 
Other stories cover the evolving crisis in Libya, as well as lighter, 
ongoing issues, such as Harry Potter, and Lady Gaga's antics. 

For comparative purposes, we also performed the same proce- 
dure on a dataset consisting of all blog posts made on major blog 
hosting platforms during the same day; due to space constraints the 
results can be found in [4]. 

Finally, to validate the effectiveness our approach, we have built 
a live demo of our techniques, in the context of Grapevine [3]. This 
prototype processes millions of blog posts on a daily basis, and 
computes important stories in real-time. It consists of a pipeline 



19 As our dataset corresponds to tweets made in one day. 
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that processes blog posts as they are crawled, rejecting spam and 
non-english language posts, extracts named entity mentions, up- 
dates the entity graph, and uses DynDens to update the set of cur- 
rent dense subgraphs, as in the "unweighted" dataset used in our 
experiments. It also maintains track of output-dense subgraphs, 
which are reported to the user upon request. Besides the entities 
involved in each output-dense subgraph/story, a few links to rele- 
vant blog posts are provided, as well as a link back to Grapevine 
for further exploration of the historical evolution of the story. In- 
terested readers are encouraged to explore this prototype, available 
at www . onthegrapevine . ca/now . j sp . 

6. HEURISTICS 

In concluding our exposition of DynDens, let us also exam- 
ine two additional heuristics that can offer modest performance im- 
provements, without affecting the quality of results. Both are re- 
lated to limiting the number of explorations, and cheap explorations 
performed. Due to space constraints, the full details for these, and 
proofs of their correctness, are omitted, and can be found in [4]. 

MaxExplore: Whereas it serves to prove the correctness of 
DynDens, the previous bound on exploration iterations that need 
to be performed on a subgraph C is overly pessimistic, as it is based 
on several worst-case assumptions. To overcome this challenge, 
we developed MaxExplore, an improvement over the previous 
bound, that takes the graph neighborhood of the updated edge, as 
well as the cardinality of the subgraph being explored, into account. 
As it is a fairly cheap bound to compute, we can expect MaxEx- 
plore to lead to performance improvements in the case of dense 
subgraphs on which multiple exploration iterations would have oth- 
erwise been performed. 

DEGREEPRIORITIZE: Another challenge in the basic form of 
DynDens discussed so far, is that a single graph might be ex- 
plored multiple times, by exploration procedures originating from 
each of its dense subgraphs. To mitigate the adverse effects this can 
have on performance, we developed DEGREEPRIORITIZE, a way to 
organize the search space, and thus often avoid performing redun- 
dant explorations, inspired by the degree-based criterion proposed 
in [28]. At a high level, it guarantees that DynDens does not need 
to explore (or cheap-explore) a subgraph with vertices having dense 
connections to the subgraph. We thus expect DEGREEPRIORITIZE 
to offer the greatest benefit to performance in cases of dense sub- 
graphs on which redundant, multiple-iteration explorations would 
have otherwise been performed. 

Evaluation: In our evaluation of DynDens, the above heuris- 
tics were enabled. Thus, to evaluate their performance benefits, we 
also evaluated variants of DynDens where either DEGREEPRIOR- 
ITIZE and/or MaxExplore were disabled, on both our weighted 
and unweighted datasets. We observed that these heuristics were 
responsible for very modest performance improvements of up to 
4%, and sometimes even resulted in worse performance. 

By design, we expect the proposed heuristics to offer perfor- 
mance benefits in cases where many explorations would have other- 
wise been performed in their absence. To validate this, and further 
investigate their potential to improve performance, we evaluated 
them on a synthetic dataset that consisted of near-cliques, mixed 
with random edges, that was generated as follows: In an initially 
empty graph with 100K vertices, 250K updates were generated, 
each of magnitude (0,0.1] (with probability 0.3 the update was 
negative). With probability 0.9, the update occurred within one 
of 100 predefined sets of 10 vertices each; otherwise, it was uni- 
formly randomly distributed to the remainder of the graph. Finally, 
in order to evaluate the proposed heuristics in the absence of too- 
dense subgraphs, updates that would result in too-dense subgraphs 



for T — 0.7 and Su at 40% of its maximum value, were rejected. 

Figure 4(h) shows the time taken by each DynDens variant (no 
heuristics enabled, only DEGREEPRIORITIZE enable, only MAX- 
EXPLORE enabled, both heuristics enabled), normalized by the time 
taken by the first variant; the operating parameters were T = 0.7, 
N max € {8, 9, 10}, and 5u at 40% of its maximum value (note 
that the Y axis does not start at 0). The proposed heuristics are 
seen to offer performance improvements of up to over 10%; thus, 
while not as crucial as iMPLICITTooDENSE to performance, we 
believe that the low effort required to implement these heuristics 
make them worthwhile for inclusion in DynDens. 

7. RELATED WORK 

While we are not aware of any work that addresses the main- 
tenance of dense subgraphs in weighted graphs, under streaming 
edge weight updates, for a broad definition of density, there exists 
a rich literature of works dealing with related problems. 

[27] addresses incremental maximal clique maintenance, from a 
mostly theoretical perspective, and using a growth property. This 
is very closely related to a special case of ENGAGEMENT (namely, 
for unweighted graphs, AvgWeight, and T — 1). An important 
difference is that our instantiation of ENGAGEMENT deals with all 
cliques, with cardinality constraints, as opposed to maximal cliques 
of unconstrained cardinality. As discussed in Section 5.2, while the 
former is better suited to real-time story identification, the latter 
may be preferable in other scenarios. 

[28] addresses near-clique identification, in an offline setting, 
again from a mostly theoretical perspective, and using a growth 
property; this corresponds to the offline version of ENGAGEMENT 
for unweighted graphs, and AVG WEIGHT. The techniques pro- 
posed therein cannot be efficiently dynamized in a straight-forward 
fashion, as the information they rely upon cannot be efficiently 
maintained across updates. Our DEGREEPRIORITIZE pruning con- 
dition is inspired by the parent degree-based criterion proposed in 
this work. [23] addresses the same problem, using a similar growth 
property, and with a focus on a parallel implementation. As with 
the other works, the techniques developed therein are not straight- 
forward to efficiently dynamize. 

Max (quasi-) clique: Related problems occur in the maximum 
clique [25] and quasi-clique literature. To overcome the intractabil- 
ity and inapproximability of this problem, heuristics (typically ran- 
domized) have been used to discover large (quasi-) cliques. A cru- 
cial difference is that ENGAGEMENT requires the enumeration of 
all dense subgraphs (as from an application perspective, each sub- 
graph corresponds to a story of interest). In contrast, works in the 
maximum (quasi-) clique domain are geared towards identifying 
one "good" subgraph per execution iteration. Moreover, most such 
heuristic techniques are not straightforward to efficiently dynamize. 

Perhaps most closely related is the state-of-the-art Greedy Ran- 
domized Adaptive Search Procedure used in [1] to identify large 
dense subgraphs (quasi-cliques). Although this work is more fo- 
cused towards developing techniques for limited main-memory sce- 
narios, their techniques can be dynamized in an efficient manner to 
address ENGAGEMENT for unweighted graphs and AvgWeight 
(cf. Section 5.2). 

Local density: Other works have dealt with edge-weight update 
semantics, albeit with much simpler definitions of density. For in- 
stance [30] and others maintain dense subgraphs over sliding win- 
dows using neighbor-based patterns (i.e. whether a dense subgraph 
should be augmented with an additional node is decided based on 
local information only). As the problem being addressed therein 
is very different from ENGAGEMENT, the proposed techniques are 
inapplicable in the latter domain. 
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Max-flow: [12], [20] and others use (primarily) max-flow based 
algorithms to identify dense subgraphs. While max-flow algorithms 
can be dynamized [22], [18], these algorithms can only identify and 
maintain clusters containing user-specified nodes. In a related vein, 
[14] uses max-flow to find the top-1 dense subgraph (for AvgDe- 
GREE); however their techniques cannot be efficiently applied to a 
top-fc or threshold variant, nor can they be efficiently dynamized. 

Dynamic graphs: Other works (e.g. [10], [6]) have dealt with 
dynamic graph algorithms under edge weight updates, but do not 
deal with density problems, focusing instead on properties such as 
planarity, connectivity, triangle counting, etc. A notable exception 
is [17], which discusses approximation algorithms to general maxi- 
mization problems in dynamic graphs. It is, however, theoretical in 
nature, and its focus is on the approximation ratio of the resulting 
algorithm, not on efficiency. 

Clustering: Related problems are also dealt with in the incre- 
mental clustering literature (e.g. [11], [15], [8]); however, these 
deal with graph node insertion and deletion, and the proposed tech- 
niques cannot directly accommodate streaming edge weight up- 
dates. A tangentially related problem is evolutionary clustering 
([7], [21]) which identifies clusters based on both density, and his- 
torical data; the goal is to introduce temporal smoothing, so that 
clusters behave in a stable fashion over time. 

Communities of interest: [9], and its extension [19], address 
the problem of supporting efficient retrieval of important 2-neighbors 
of any node, where the importance of a neighbor is related to local 
and global edge thresholds. The focus is on better representation 
of actual interactions, and removal of spurious information, and 
the provided insights are invaluable for any applications that in- 
volve dynamic graphs. However, the problem examined in these 
works, is substantially different from ENGAGEMENT, hence tech- 
niques proposed in these works do not apply in ENGAGEMENT. 

Shingling: [13] proposes techniques to identify large dense sub- 
graphs in an offline fashion via recursive shingling. While this 
could potentially be dynamized, it is geared towards large sub- 
graphs (100-10K nodes), and would not be effective on smaller 
subgraphs. [29] also uses LSH to identify cliques of moderate size 
in large graphs; it is however not easily amenable to dynamization, 
as it has a significant preprocessing phase. 

Data structures: Finally, the index structure used by DynDens 
resembles the FP-tree [16], in that both store overlapping subsets 
in a prefix tree, with inverted lists embedded into the tree structure. 
However, the FP-tree is optimized for static data, and assumes that 
tree nodes can be statically ordered in a way that heuristically de- 
creases tree size; this makes it unsuitable for ENGAGEMENT, where 
tree nodes dynamically change. Moreover, other improvements of 
the FP-tree over a plain prefix tree are not applicable to ENGAGE- 
MENT, as the problems solved are different. 

8. CONCLUSIONS 

Motivated by the need to mine important stories and events from 
the social media collective, as they emerge, in this work we exam- 
ine the problem of maintaining dense subgraphs under streaming 
edge weight updates. For a broad definition of graph density, we 
propose the first efficient algorithm, DynDens, which is based on 
novel theoretical results regarding the magnitude of change that a 
single edge weight update can have. DynDens is highly efficient, 
and able to gracefully scale to rapidly evolving datasets, and we 
validate the efficiency and effectiveness of our approach via a thor- 
ough evaluation on real and synthetic datasets. 

Moreover, there are many exciting new directions stemming from 
this work. For example, an important problem in the social media 
space is the timely identification of online communities. While it is 



easy to see how ENGAGEMENT can be applied to this domain, its 
characteristics are somewhat different from those of real-time story 
identification (e.g. social graphs are frequently directed, communi- 
ties are typically subgraphs of larger cardinality than stories, etc.), 
and it would be interesting to explore how to adapt DynDens to 
the diverse challenges this domain imposes. Another interesting 
technical problem arises when considering the need for adjusting 
the density threshold T, during execution - e.g. in order to adapt 
to changes in the dataset. We are actively exploring adapting the 
techniques used in DynDens to more efficiently perform this task. 
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